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Abstract 

We have carried out a human cDNA sequencing project to accumulate information regarding the coding 
sequences of unidentified human genes. As an extension of the preceding reports, we herein present the 
entire sequences of 150 cDNA clones of unknown human genes, named KIAAI294 to KIAA1443 from two 
sets of size- fractionated human adult and fetal brain cDNA hbraries. The average sizes of the inserts and 
correspondmg open reading frames of cDNA clones analyzed here reached 4.8 kb and 2.7 kb (910 amino 
acid residues), respectively. From sequence similarities and protein motifs, 73 predicted gene products 
were functioaally annotated and 97% of them were classified into the following four functional categories- 
A^l"'^" ^""^^ management, cell structure/motility and protein management 

Additionally, the chromosomal loci of the genes were assigned by using human-rodent hybrid panels for 
those genes whose mapping data were not available in the public databases. The expression proHles of 
the genes were also studied in 10 human tissues, 8 brain regions, spinal cord, fetal brain and fetal liver 
by reverse transcription-coupled polymerase chain reaction, products of which were quantified by enzyme- 
linked immunosorbent assay. 

Key words: large proteins; in vitro transcription/translation; cDNA sequencing; expression profile- 
chromosomal location; brain ' 



We have been making efforts to accumulate infor- 
mation on the coding sequences of unidentified human 
genes. Especially, recent our interest i.s focused on 
the unidentified genes encoding large proteins in human 
brain since these gene products are likely to play im- 
portant roles in the central nervous system. ^-^ To iden- 
tify such genes, we constructed a set of strictly size- 
fractionated cDNA libraries from human bram and in 
vitro transcription/translation system have been applied 
to select the cDN.A. clones coding for large proteins prior 
to the determination of their entire sequence.^ As an al- 
ternative method for clone selection, we have recently 
introduced a computer-based approach using GeneMark 
analysis for picking up cDN'A clones with a high proba- 
bility of coding for protein.'* This new approach would be 
expected to minimize the risk of overlooking important 
cDN'A clones which fail to produce proteins in vitro. 

The sequences of more than 1200 cDNA clones have 
been reported by our project and the total length of the 
determined sequences exceeds 6.3 Mb^~^ and the average 
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length of gene products deduced from the cDXAs from 
brain is over 900 amino acid residues.-*^ As an extension 
of the preceding reports, we herein report the coding se- 
quence features of 150 new cDNA clones which have the 
potential to code for large proteins in vitro. In addition 
to the specific features of the newly predicted protein 
sequences annotated by the database search, the expres- 
sion profiles and the chromosomal locations of these 150 
new genes are also described. The information regarding 
these newly identified genes would greatly increase our 
understanding of the biological functions of human genes 
at the molectilar level - 

1. Sequence Analysis and Prediction of Protein- 
Coding Regions in cDNA Clones 

cDNA clones to be entirely sequenced were selected 
according to the following criteria: (1) novelties of their 
single-pa^s sequences of both the cDN.\ ends; (2) po- 
tentialities of their protein coding. The latter criterion 
was critical for us to conduct our cDNA project effi- 
ciently, because there are many cDN'A clones which ap- 
parently do not possess a protein-coding region in the 
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Figure 1. Physical maps ofcDNA clones analyzed. The physical maps shown here were constructed from the scqtience daia of rcf^pcctivc 
cDNA clones or, when necessary, from the combination of cDNA clones and RT-PCR products. The horizontal scale represents the 
cDNA length in kb, and the gene numbers corresponding to respective cDN'As are given on the left. The ORFs and untranslatc-d 
regions are shown by solid and open boxes, respectively. The posjtions of the first ATG codons, with or \^hout the contexts of 
the Kozak's rule, are indicated by solid and open triangles, respectively. Repeat.Masker, a program that screens DNA sequences for 
interspersed repeats known to exist in mammalian genomes, was applied to detect repeat sequences in respective cDNA sequences 
(Smit, A.F.A. and Green, P.. RepeatMaskcr at http://ftp, genome. Washington edu/RM/RepeatMaskcr.httnl). Short interspersed 
nucleotide elements (SINEs) including Alu and MlRs sequences and other repetitive sequences thus detected are represented by 
dotted and hatched boxes, respectively. 
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Table 1. Information of sequence data and chromosomal locations of the identified genes. 
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a) Accession numbers of DDBJ, EMBL and GenBank databases, b) Values excluding poly(A) sequences, c) Values were 
calculated from the number of amino acid residues between two termination codons in the case where the in-frame termination 
codon exists upstream of the first ATG codon. d) Chromosome numbers were identiaed by using GeneBridge 4 radiation 
hybrid panel unless specified. The actual primer sequences and the PGR conditions used for the radiation hybrid mapping 
are accessible through the World Wide Web at http;//www.kazusa.or.jp/hage. The chromosomal locations highlighted by 
asterisks were fetched from the UniGene database. The chromosomal locations highJighted by sharp were referred from the 
GenBank database because the sequences of the cDNA clones could be found in the genomic sequences whose chromosome 
numbers were assigned, e) cDNA and ORF lengths were revised by direct analysis of the RT-PCR products, f) Nucleotide 
sequences were determined after subcloning of the internal I'^ot I-digested fragment. Therefore. cD.N'A length'of these genes 
represented those of interna/ iVoi I-digested fragment, g) cDNA clones were selected by analysis of 5'-end Single-pass sequences 
using the GeneMark analysis. 
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Table 2 Functional classifications of the gene products. 
2-1. Predicted function based on homology search*' 
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cDNA libraries derived from tissue poIy(A)+ R\-\ To 
screen cDNA clones according to their protein-codine 
capability, we have used an m vUro expression sys 

rlurn"^''tf' '"^^^^^^^^ ^ computer-based method 
called GeneMark analysis for minimizing the risk of 
overlooking important cDNA clones.^-^ In this report. 

and' .mT '"^''''^ GeneMark analysis 

and 129 cDNA clones were selected by the m mtro 
expression system. These cDNA clones were isolated 
from the size- fractionated human adult brain cV^\ H- 
brari^ Nos, 2 to 5 (insert sizes ranging from 4 to 6 kb) 
and the size-fractionated human fetal brain cDNA li- 
braxies Nos. 4 and 6 (insert sizes ranging from 4 to 
7 kb) previously constructed.^.^ The clones with uniden- 
tified sequences at both ends were chosen by single- 



pass sequenciDg and a homolog>' search was performed 
against the GenBank database (release 113.0) excluding 
expressed sequence tags and genomic sequences.^ A total 
of 35 cDXA clones (KIAA1389-KIAA1402, KIAA14I5- 
KIAA1422. KIAAI424, KIAA1425 and KIAA1433- 
KIAA1443) were selected from the adult brain libraries 
and the remaining 115 cDN'A clones were obtained from 
the fetal brain cDNA libraries. Entire sequencing of 
these clones was performed according to the methods 
previously described in detail.^.^ Twenty-three clones 
(KrAAl403-KIAAl425) seemed to casrry spurious coding 
interruption caused by errors of the reverse transcrii>- 
tase or by retained intron sequences. For these cases, 
the sequences of the regions causing interruption of an 
open reading frame (ORF) were reexamined by direct se- 
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Euluryotic protein kinase domain 

PH domain 

RhoCAP domain 

Cullin family 

Domain found in Dishevelled. Esl-10. and Plccksirin 
PH domain 

Phosphotyrosine interaction domain 



RNA rrcognicion mosif 
Zinc finger C-a8-C-x5-C-»3-H type 
Myb-like DNA-bindins domaJn 
Myb-Iike DNA-bindinj domain 
ELM2 domain 
BTB/POZ domain 
Keich motif 
Keich motif 
Kelch motif 

Keich motif ^ ' 

Kelch motif 
Keich motif 

Myb-like DNA>binding domain 
Zinc finger. C2H2 type 
Zinc finger. C2H2 type 
Zinc finger. C2H2 type 
Zinc finger. C2H2 type 

Zinc finge r. C2H2 type 

Calponin homology (CH) domain 



3.30E-06 LIM domain contain in» prtjteins 



PF00389 



Z 20E-OI HECT^omain 

6J0E-01 Ubiquitin carboxyl-TermiraJ hydrolase family 2 

4.iOE-l3 Ubiquitin cafbonyl-ierminal hydrolases family 2 

9.10E-20 Ubiquilin carboxyl-tcrminal hydrolase family 2 

r40E -0l Ribosomal protein Li I 



3.50E-OI D-itomer specific 2 hydfP«yacid dehydrogenases 



a) Motif search was performed by HMMER2.1.1 against Pfam database (release 4.4). b) Function was classified based on 

(tv:r)t:rtbi' i^o're -'^ ^-^'^ ^^-^ -^-^ 



quencing of the major reverse transcription-coupled poly- 
merase chain reaction (RT-PCRy products to precisely 
predict protein-coding sequences.^ This examination re- 
vealed spurious interruptions in the following clones: 
ORFs in 7 clones (KIAA1403, KIAA1405, KIAA1409 
KIAA1410. KIAA1415, KIAA1424 and KIAA1425) were 
found to carry single- or multiple-insertions most of 
which probably corresponded to intronic sequences; 
ORFs in 7 clones (KIAA1411, KIAA1412, KIAAHui 
KIAA1416, KIAA1418, KTAA1420 and KIAA1421) 
were frame-shifted by single- or double-short inser- 
tions or single-deletion (< 5 nucleotide residues); ORFs 
in 4 clones (KIAA1404, KIAA1408, KIAA1417 and 
KIAA1423) were found to carry single- or double- 
deletions; ORFs in 4 clones (KIAA1406. KrAAl407, 
KIAA1414 and KIAA1422) were divided into some por- 



tions by a combination of spurious interruptions in- 
cluding insertions/deletions. KTAA1419 carried a non- 
sense mutation in the ORF. For those genes, the re- 
vised sequences by the RT-PCR experiments, not the 
actual cloned cDNA sequences, were deposited to Gen- 
Bank/EMBL/DDBJ databases and used for analyses in 
this study including prediction of their protein-coding 
sequences unless otherwise stated. The results of the 
comparison between the cloned DNA and the revised 
DNA sequences are available through the World Wide 
Web site at http://www.lcazusa.or.jp/huge. The actual 
primer sequences and the PGR conditions used for the 
RT-PCR experiment are accessible through the web site 
at http://www.kazusa.or.jp/'>-hirosawayinterruption/ 
entrance.html. Notably, clones, -for eight genes 
(KIAA1297, KIAA1395, KIAA139S. KTAA1410, 

3^y 
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Table 3. Homologues of the newly identified genes found in various databases.*^ 
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a) The definition of homologues used here was the proteins found in the databases satisfying the following conditions: i) 
the length ranged from 80% to L25% of the query sequence, ti) the ratio of the length of aligned region to that of the 
original sequence of the query was 80% or greater, iii) percent identity was 30% or greater. The method of homolog>' 
search was the sz^me to that explained in Table 2-1. b) The following databases were used. HUGE, our cDNA-encoded 
protein database (http://www.ka2usa.0r.jp/huge); yeast, non redundant peptide database from genome-ftp.stanford.edu: 
/pub/yeast/yeast.protein/yeast-nrpep.fasta.Z; C. elegans, protein database deduced from C elegans full genome sequence 
(ftp. Sanger. ac.uk:/pub/databases/C.elegans_sequences/C_eIegans,proteins. 1998-10-16. pep) and the eftVries derived from C. 
elegans of OWL, and OWL (release 31.4). In the case of database search against OWL, only the homologue with the highest 
score to each query was listed, c) The number of amino acid residues of the gene produt. d) The values^ean the ratio of the 
length of aligned region to the original length of the query sequence, in percentage, e) For entries from databases, yeast and 
OWL, the annotations were listed For C. elegans, IDs of OWL were listed, when sequences identical to the entries from the 
full genome were registered in OWL. 
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T. Nagase et al. 



KIAA1416, KIAA1420. KIAA1421 and KIAAU22) 
seemed to lack regions encoding C-terminal portions 
due to the presence of a Not I site in their coding re- 
gions because cDNAs were digested with Not I before 
hgation into vector. In contrast, clones for five genes 



Of the sequences due to the presence of an internal Nnt. n.,.w S.a ^.^I ^ ' .^T?'"" l^^If}'"^"^^^ ^^^^ 



ot the sequences due to the presence of an internal Not 
I site in their sequences. For these five genes, the nu- 
cleotide sequences of only the region between two Not 
I sites were determined, since their original clones were 
most likely to harbor two intermolecularly ligated inde- 
pendent cDXAs.e After these revisions, the average size 
of the cDNA sequences became 4.8 kb and that of the 
ORFs corresponded to approximately 910 amino acid 
residues. Physical maps of the 150 cDNA sequences an- 
alyzed are shown in Fig. 1, where the ORFs and the first 
ATG codons in respective ORFs are indicated by solid 
boxes and triangles, respectively. Repeat sequences are 
also shown in Fig. 1, Comparing the predicted protem- 
codmg sequence for KIAA1299 with those of mouse and 
rat homologues/'« this cDNA clone seems to encode a 
complete protein although it possessed an unusually long 
5' non-coding sequence expanding more than 3 kb. Ta- 
ble 1 lists the lengths of inserts, the ORF lengths and the 
chromosomal locations of the respective clones. Chro- 
mosomal loci of 66 newly identified genes were assigned 
usmg human-rodent hybrid panels, GeneBridge 4 '(Re- 
search Genetics Inc., USA).^ since their mapping data 
were not available in the public databases. The chromo- 
somal locations of the 78 genes, which are highlightrd 
by asterisks in Table 1, were fetched from the Unirjene 
database (http://www.ncbi.nIm.nih.gov/UniGene). The 
chromosomal locations of the remaining six genes, which 
are highlighted in Table 1. were obtained from the Gen^ 
Bank database because the sequences of the cDNA clones 
were already assigned to chromosome numbers. 



71 

since they did not show sequence similarity to function 
ally annotated proteins (Table 2-2). In total, 63 crgae 
products (86.3% of genes functionally annotated h^'ere) 
were suggested to have functions relating to cell sig- 
nahng/communication, nucleic acid management or cell 



2. 



Functional Classification of Predicted Gene 
Products 



The gene products predicted- from the cDNA se- 
quences were classified by homology and/or motif search 
agamst the following public databases: protem .sequence 
database, OWL (release 314), databases of predicted 
protein sequences from yeast^^ and C. elegans'- genomes 
[genome-ftp.stanford.edu:/pub/yeast/yeast protein/ 
yeast_nrpep.faita.Z, ftp.sanger.ac. uk:/pub/databa5cs/C 
e!egans^equences/C_elegans_proteinsa99S-10-16.pep' 
protem domain database, Pfam (release 4 4) and 
our own database. HUGE^"^ (http://www kazt.sk.or.jp/ 
huge). As shown in Table 2, the 73 gene products ^.ere 
classified mto five functional categories. Among them 
o3 gene products indicated significant sequence similar- 
ity to functionally annotated proteins (Table 2-1). The 
functions of the other 20 gene products ^vere predicted 
based on the presence of functional motifs/domains 



nucleic acid management, 5 coded for DNA binding pro- 
terns carrying C.Ho-type zinc finger domains. The aver- 
age number of these domains among these gene products 
was about 15. Since the majority of zinc finger proteins 
m yeast contain only two domains per polypeptide, mul- 
tiple appearance of C2Pl2-type zinc finger domains in a 
smgle polypeptide might be a specific character of large 
protems m multicellular organisms. To find the genes 
conserved in other species, we tentatively defined "homo- 
logues" as genes sharing at least 30% of protein sequence 
Identity spanning almost the entire region (more than 
80% coverage against the query protein sequence). As 
shown in Table 3, 48 KIAA gene products were found to 
have the "homologues" in the databases. Homologues to 
9 of the 48 KIAA proteins were found in C. elegans and 
3 (KIAA1347, KIAA1352 and KIAA1401) were found in 
both yeast and C. elegans.. KIAA1347 and KIAA1352 
were similar to Ca-+-transporting ATPase and leucvl- 
tRNA synthetase, respectively, though KTAA1401 h'ad 
no similarity to any functionally known genes. 

3. Expression Profiles of Predicted Genes 

The expression profiles of the genes newly identified in 
this study are shown in Fig. 2 by using color codes. 
MAA1379 was homologous to rat synaptic dvnamin- 
associated protein I (Syudapin and predominantly 
expressed m hippocampus The gene expression lev- 
els of KIAAI341 and KTAA1366, which were similar to 
mouse transcriptional suppressor of the myelin basic pro- 
tem gene^^ and rat neuroligin 2,^^ respectively, were rela- 
tively high in all brain regions examined. KIAA1346 and 
KIAA 1434 were predominantly expressed in spinal cord ' 
KIAA1312, KTAAI13I5 and KIAA1417 were expressed 
very poorly m all regions examined, but their mR\'-\s 
were detected. These expression profiles also provide us 
important information for identifying biologically impor- 
tant genes characterized in this project. 
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Figure 2. Expression profiles of 150 newly identified genes exammod by RT-PCR ELISA. The lissue expression levels of the 150 human 
genes were analyzed by using the RT-PCR ELISA according to methods previously described.*^ Gene names arc given as KIAA 
numbers at the left side of each set of rolor codes. Tissue and brain region names are indicated above the top" sets of color codes A 
color conversion panel shown al the boLtom was used for displaying mRN'A levels as color codes. The mRNA levels are expressed 
in equivalent amounts (fg) of the authentic cDN'A plasmids in 1 ng of starting poly(A)'^ RNAs. Besides 10 tissues, 9 regions of the 
adult central nervous system (amygdala, corpus callosum, cerebellum, caudate nucleus, hippocampus, substantia nigra, subthalamic 
nucleus, thalamus, and spinal cord) and fetal brain w-ere incl'jded in the expression profiling. As a control, mRNA levels in fetal liver 
were also examined. 
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